{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 助教哥你好呀~\n",
    "\n",
    "### 这个文件是可以**直接运行**的加载模型的代码\n",
    "\n",
    "-----\n",
    "\n",
    "*因为中间有一些处理数据的过程, 整个文件运行时间大概在十分钟, 用jupyter notebook打开可以看到我最后一次运行与输出的结果*\n",
    "\n",
    "##### 助教哥辛苦了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "E:\\Anaconda3\\lib\\site-packages\\h5py\\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "from keras.models import load_model\n",
    "\n",
    "model = load_model('最高分的训练好的模型.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**接下来 读取预处理好的测试数据**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>class</th>\n",
       "      <th>positive</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>index</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>﻿18年结婚 哈哈哈</td>\n",
       "      <td>0</td>\n",
       "      <td>0.900696</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2017最后顿大餐吃完两人世界明年就是三个人一起啦许下生日愿望️希望一家人都能顺利平安健康🏻🏻🏻</td>\n",
       "      <td>1</td>\n",
       "      <td>0.999904</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>意盎然的季节！祝愿大家都生机勃勃，郁郁葱葱！</td>\n",
       "      <td>2</td>\n",
       "      <td>0.736431</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2017 遇见挚友 遇见我老公 结了婚有了小芒果     希望2018也超级美好️</td>\n",
       "      <td>3</td>\n",
       "      <td>0.983905</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2018.1.1</td>\n",
       "      <td>4</td>\n",
       "      <td>0.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2018加油！</td>\n",
       "      <td>5</td>\n",
       "      <td>0.895319</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2018年做一个更加真实的自己。️</td>\n",
       "      <td>3</td>\n",
       "      <td>0.783433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2018年的第一天，完美的错过了一辆公交车。 德州</td>\n",
       "      <td>6</td>\n",
       "      <td>0.934181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2018年目标1.赚钱买房2.谈场恋爱，遇到对的人就结婚3.拥有一副健康的身体4.学会一种乐...</td>\n",
       "      <td>7</td>\n",
       "      <td>0.999799</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2018年第一个假期：元旦，就这么过去了，感冒咳嗽发高烧给这个元旦带来了不一样的节日，好快呀...</td>\n",
       "      <td>8</td>\n",
       "      <td>0.733896</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    text  class  positive\n",
       "index                                                                    \n",
       "0                                             ﻿18年结婚 哈哈哈      0  0.900696\n",
       "1       2017最后顿大餐吃完两人世界明年就是三个人一起啦许下生日愿望️希望一家人都能顺利平安健康🏻🏻🏻      1  0.999904\n",
       "2                                 意盎然的季节！祝愿大家都生机勃勃，郁郁葱葱！      2  0.736431\n",
       "3              2017 遇见挚友 遇见我老公 结了婚有了小芒果     希望2018也超级美好️      3  0.983905\n",
       "4                                               2018.1.1      4  0.500000\n",
       "5                                                2018加油！      5  0.895319\n",
       "6                                      2018年做一个更加真实的自己。️      3  0.783433\n",
       "7                              2018年的第一天，完美的错过了一辆公交车。 德州      6  0.934181\n",
       "8      2018年目标1.赚钱买房2.谈场恋爱，遇到对的人就结婚3.拥有一副健康的身体4.学会一种乐...      7  0.999799\n",
       "9      2018年第一个假期：元旦，就这么过去了，感冒咳嗽发高烧给这个元旦带来了不一样的节日，好快呀...      8  0.733896"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import jieba\n",
    "\n",
    "dff = pd.read_csv(\"./Preprocessed_data/train.csv\",index_col=0)\n",
    "dff['text'] = dff['text'].fillna('')\n",
    "dff.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>class</th>\n",
       "      <th>positive</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>index</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>我是正面哦</td>\n",
       "      <td>0</td>\n",
       "      <td>0.347826</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>爱是恒久忍耐，又有恩慈。爱是不嫉妒，不自夸，不张狂，不轻易发怒。不计算人的恶。凡事包容。凡事...</td>\n",
       "      <td>0</td>\n",
       "      <td>0.496333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>讨厌死了，上班上班上班不停的上班我真的超级累。什么都不干还是超级超级累。</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000422</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>矮马大半夜的放肌肉男不让人睡觉了</td>\n",
       "      <td>0</td>\n",
       "      <td>0.409895</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>谢谢陈先生。</td>\n",
       "      <td>0</td>\n",
       "      <td>0.768959</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>我的2016要早点睡别熬夜</td>\n",
       "      <td>0</td>\n",
       "      <td>0.625607</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>周锐锐哥！爱你</td>\n",
       "      <td>0</td>\n",
       "      <td>0.970187</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>塞尼亚岛</td>\n",
       "      <td>0</td>\n",
       "      <td>0.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>只可惜没能去现场</td>\n",
       "      <td>0</td>\n",
       "      <td>0.100791</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>自从发现这个号都处于一种忍不住不看看了睡不着的状态</td>\n",
       "      <td>0</td>\n",
       "      <td>0.355194</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    text  class  positive\n",
       "index                                                                    \n",
       "0                                                  我是正面哦      0  0.347826\n",
       "1      爱是恒久忍耐，又有恩慈。爱是不嫉妒，不自夸，不张狂，不轻易发怒。不计算人的恶。凡事包容。凡事...      0  0.496333\n",
       "2                   讨厌死了，上班上班上班不停的上班我真的超级累。什么都不干还是超级超级累。      0  0.000422\n",
       "3                                       矮马大半夜的放肌肉男不让人睡觉了      0  0.409895\n",
       "4                                                 谢谢陈先生。      0  0.768959\n",
       "5                                          我的2016要早点睡别熬夜      0  0.625607\n",
       "6                                                周锐锐哥！爱你      0  0.970187\n",
       "7                                                   塞尼亚岛      0  0.500000\n",
       "8                                               只可惜没能去现场      0  0.100791\n",
       "9                              自从发现这个号都处于一种忍不住不看看了睡不着的状态      0  0.355194"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfTest = pd.read_csv(\"./Preprocessed_data/test.csv\",index_col=0)\n",
    "dfTest['text'] = dfTest['text'].fillna('')\n",
    "dfTest.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**还有一点处理, 很快了**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache C:\\Users\\Kai\\AppData\\Local\\Temp\\jieba.cache\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading model cost 0.810 seconds.\n",
      "Prefix dict has been built succesfully.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100000\n",
      "200000\n",
      "300000\n",
      "400000\n",
      "500000\n",
      "600000\n",
      "700000\n",
      "800000\n",
      "0\n",
      "100000\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "E:\\Anaconda3\\lib\\site-packages\\keras_preprocessing\\text.py:178: UserWarning: The `nb_words` argument in `Tokenizer` has been renamed `num_words`.\n",
      "  warnings.warn('The `nb_words` argument in `Tokenizer` '\n"
     ]
    }
   ],
   "source": [
    "def stopwordslist():\n",
    "    f = open(\"./Preprocessed_data/stop.txt\", \"r\")\n",
    "    line = f.readline()\n",
    "    stopwords = []\n",
    "    index = 0\n",
    "    while line:\n",
    "        index += 1\n",
    "        line = line.replace('\\n', '')\n",
    "        line = line.replace('[', '')\n",
    "        line = line.replace(']', '')\n",
    "        line = line.replace('］', '')\n",
    "        line = line.replace('［', '')\n",
    "        \n",
    "        stopwords.append(line)\n",
    "        line = f.readline()\n",
    "\n",
    "    return stopwords\n",
    "\n",
    "stopwords = stopwordslist()\n",
    "\n",
    "def seg_depart(sentence):\n",
    "    sentence_depart = jieba.cut(sentence.strip())\n",
    "    outstr = ''\n",
    "    for word in sentence_depart:\n",
    "        if word not in stopwords:\n",
    "            if word != '\\t':\n",
    "                outstr += word\n",
    "                outstr += \" \"\n",
    "    return outstr\n",
    "\n",
    "sen = dff['text'].values\n",
    "\n",
    "for i in range(len(sen)):\n",
    "    if i % 100000 == 0:\n",
    "        print(i)\n",
    "    sen[i] = seg_depart(sen[i])\n",
    "    \n",
    "\n",
    "senTest = dfTest['text'].values\n",
    "\n",
    "for i in range(len(senTest)):\n",
    "    if i % 100000 == 0:\n",
    "        print(i)\n",
    "    senTest[i] = seg_depart(senTest[i])\n",
    "    \n",
    "\n",
    "from keras.preprocessing.text import Tokenizer\n",
    "from keras.preprocessing.sequence import pad_sequences\n",
    "\n",
    "MAX_NB_WORDS = 20000\n",
    "tokenizer = Tokenizer(nb_words=MAX_NB_WORDS, char_level=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**上面的输出是运行进度的一些信息, 上面的cell大概需要运行五分钟**\n",
    "\n",
    "*很快就好啦*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer.fit_on_texts(sen)\n",
    "sequences_test = tokenizer.texts_to_sequences(senTest)\n",
    "MAX_SEQUENCE_LENGTH = 300\n",
    "\n",
    "x_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "50000\n",
      "100000\n",
      "150000\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import csv\n",
    "\n",
    "pred = model.predict(x_test)\n",
    "result = np.argmax(pred, axis = 1)\n",
    "\n",
    "# 写入文件\n",
    "csvFile = open('FORCheckResult.csv','w', newline='', encoding='UTF-8') # 设置newline，否则两行之间会空一行\n",
    "writer = csv.writer(csvFile)\n",
    "\n",
    "writer.writerow(['ID', 'Expected'])\n",
    "for i in range(len(result)):\n",
    "    if i % 50000 == 0:\n",
    "        print(i)\n",
    "    writer.writerow([int(i), int(result[i])])\n",
    "    \n",
    "csvFile.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 最高分的训练好的模型.h5  预测的 test.data 已经被输出到当前文件夹下的 FORCheckResult.csv 啦\n",
    "\n",
    "\n",
    "\n",
    "#### 辛苦了 {心}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
